Strategies for Effective Chemical Information Retrieval

نویسندگان

  • Suleyman Cetintas
  • Luo Si
چکیده

We participated in the technology survey and prior art search subtasks of the TREC 2009 Chemical IR Track. This paper describes the methods developed for these two tasks. For the technology survey task, we propose a method that constructs highly structured queries to do retrieval on different fields of chemical patents and documents in a weighted way. The proposed method i) enriches these structured queries with synonyms of the chemicals that have been identified, and ii) uses simple entity recognition to extract information for increasing or decreasing weights of some terms and to filter out documents from the ranked list. For prior art search task; we propose an automated query generation method that uses all title words, and selects sets of terms from the claims, abstract and description fields of query patents to transform a query patent into a search query. From the selected terms, chemical entities are extracted and synonyms for the identified chemical entities are included from PubChem. Then structured queries are formed to do retrieval over different fields of documents with different weights. Furthermore a post-processing step is also proposed that i) filters out some of the retrieved documents from the ranked list because of date constraints and ii) utilizes the IPC similarities between query patent and its retrieved patents to re-rank the retrieved documents. Empirical results demonstrate the effectiveness of these methods in both tasks. 1. I TRODUCTIO This paper describes the approaches used by members of Purdue University for technology survey and prior art search subtasks of the TREC 2009 Chemical IR Track. The Indri search engine 1 was utilized to index and retrieve various fields of documents, and its rich and powerful query language is exploited as it supports structured queries, handles synonyms, etc. The test corpus used in this year’s Chemical IR Track consists of 1,185,012 patent files from the chemical domain (classified under the IPC codes C and A61K), and covers patents in the field until 2007, registered at EPO, USPTO and WIPO (three major patent offices). The patents are in XML format, are provided by IRF 2 and contain title, claims fields along with description or abstract fields. Totally the uncompressed size of the patent files is 98.22GB. Along with chemical patent files, a total of 59,000 chemical journal articles (also in XML format) are also provided by the Royal Society of Chemistry 3 , UK. The size of the set of scientific articles is approximately 3GB. Both of the sets of patent files and scientific articles are used for the technology survey task whereas only patent files are used for the prior art search task. Domain specific information retrieval (IR) has recently been attracting more attention as important progresses have been made in IR in terms of theoretical models and evaluation. In addition to the Genomics and Legal tracks, Chemical IR Track has become another domain specific track of TREC and addresses the challenges generally in chemical IR and particularly in chemical patent IR. Although chemical IR can benefit the existing research in general purpose IR, there are distinct features in chemical IR that can be exploited. First of those distinct features is the structural information in the patents and articles. Despite a few exceptions [7], most prior research in the prior art search used the words from the claims field as the search query without examining other alternatives [2,3,4,6]. Although claims field is a very important field, other fields should also be carefully taken into account while selecting the terms for transforming patents into search queries in prior art search. In the same way, there is very limited research that also considers searching the queries in specific fields such as the abstract rather than in the whole documents [3]. Constructing a structured query by selecting query terms from various fields of documents and searching the constructed query over different fields of documents will be used as an approach in both technology survey and prior art search tasks in this work. The second distinct feature of chemical documents in general is that chemical 1 http://www.lemurproject.org/indri/ 2 http://www.ir-facility.org/ 3 http://www.rsc.org/ Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE NOV 2009 2. REPORT TYPE 3. DATES COVERED 00-00-2009 to 00-00-2009 4. TITLE AND SUBTITLE Strategies for Effective Chemical Information Retrieval 5a. CONTRACT NUMBER

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

دیداری کردن نتایج جست‌وجو در فرایند بازیابی اطلاعات

Purpose: One of the most effective ways to achieve optimum information retrieval is through visualization of Information. Search strategies, probing skills, querying of information needs and analysis of information play a significant role in the accessing of necessary and useful information. Besides the factors mentioned above, information visualization can increase the availability level of in...

متن کامل

بازیابی اطلاعات تصویری حوزه‌ی سلامت در وب از دیدگاه متخصصان علوم پزشکی:یک مطالعه کیفی

Introduction: The medical image as a source of non-textual information has an important role in the field of medicine. Since the quality of life is directly related to health, employing this type of information is effective in improving the practice of health professionals. This study was aimed to survey medical image retrieval in the Web from the perspective of experts in medical sciences. M...

متن کامل

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

Fusion of effective retrieval strategies in the same information retrieval system

Prior efforts have shown that under certain situations, retrieval effectiveness may be improved via the use of data fusion techniques. Although these improvements have been observed from the fusion of result sets from several distinct information retrieval systems, it has often been thought that fusing different document retrieval strategies in a single information retrieval system will lead to...

متن کامل

An Effective Path-aware Approach for Keyword Search over Data Graphs

Abstract—Keyword Search is known as a user-friendly alternative for structured languages to retrieve information from graph-structured data. Efficient retrieving of relevant answers to a keyword query and effective ranking of these answers according to their relevance are two main challenges in the keyword search over graph-structured data. In this paper, a novel scoring function is proposed, w...

متن کامل

Boosting Passage Retrieval through Reuse in Question Answering

Question Answering (QA) is an emerging important field in Information Retrieval. In a QA system the archive of previous questions asked from the system makes a collection full of useful factual nuggets. This paper makes an initial attempt to investigate the reuse of facts contained in the archive of previous questions to help and gain performance in answering future related factoid questions. I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009